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Abstract 



Model selection is a crucial issue in machine- 
learning and a wide variety of penalisation methods 
(with possibly data dependent complexity penal- 
ties) have recently been introduced for this purpose. 
However their empirical performance is generally 
not well documented in the literature. It is the goal 
of this paper to investigate to which extent such 
recent techniques can be successfully used for the 
tuning of both the regularisation and kernel param- 
eters in support vector regression (SVR) and the 
complexity measure in regression trees (CART). 
This task is traditionally solved via V-fold cross- 
validation (VFCV) , which gives efficient results for 
a reasonable computational cost. A disadvantage 
however of VFCV is that the procedure is known 
to provide an asymptotically suboptimal risk esti- 
mate as the number of examples tends to infinity. 
Recently, a penalisation procedure called V-fold pe- 
nalisation has been proposed to improve on VFCV, 
supported by theoretical arguments. Here we re- 
port on an extensive set of experiments comparing 
7-fold penalisation and VFCV for SVR/CART cal- 
ibration on several benchmark datasets. We high- 
light cases in which VFCV and V-fold penalisation 
provide poor estimates of the risk respectively and 
introduce a modified penalisation technique to re- 
duce the estimation error. 



'Author for correspondence (charanpal.dhanjal@lip6.fr) 



1 Introduction 

Learning algorithms generally depend on a small 
number of real-valued or discrete parameters such 
as the size of a tree in hierarchical methods, the 
stopping criteria in boosting algorithms or explicit 
regularisation/smoothing parameters. These pa- 
rameters naturally determine the complexity of the 
output function, and by doing so, also strongly in- 
fluence generalisation ability. In a general sense the 
more "complex" the learnt function is, the more 
likely it is to overfit to the data. On the contrary, 
a simple predictor will be suboptimal if the data 
is informative with regard to the learning prob- 
lem. From the model selection point of view the 
challenge consists in selecting values of the param- 
eters of interest with a theoretical risk as small 
as possible. From a global perspective, there ex- 
ist essentially two major approaches to model se- 
lection: methods related to data-splitting, with 
cross-validation [1] and its variants, and methods 
related to penalisation of the empirical risk (that 
obtained on the training set), with in particular 
the Structural Risk Minimisation principle [36]. 
Penalisation-based approaches aim to approximate 
the ideal model by adding a penalty or complexity- 
based term to the empirical risk, generally based 
on theoretical arguments (i.e. on probabilistic 
distribution-free upper bounds for the excess of 
risk). V-fold cross-validation (VFCV in abbrevi- 
ated from) is widely used in machine- learning prac- 
tice due to its (relative) computational tractabil- 
ity and empirical evidence of its good behaviour. 
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However, there is little theoretical justification con- 
cerning VFCV [3] and the question of an auto- 
matic choice of the parameter V remains widely 
open [8]. On the other hand, penalisation proce- 
dures generally suffer from two possible drawbacks, 
despite the fact that they are theoretically well- 
founded: cither they are designed in a simplified 
framework and thus are not robust to complex sit- 
uations (typically the case for Mallows' C p in re- 
gression [26]), or they are of general purpose, as 
for instance the so-called Rademacher complexities 
[4], but are inaccurate in many cases. In order to 
refine Rademacher penalties, several authors pro- 
posed localisation techniques, giving rise to local 
Rademacher complexities [23], but these more ac- 
curate capacity functions are essentially of theoret- 
ical interest and cannot be used in practice due to 
the presence of unknown constants in their defini- 
tion. 

Combining both the robustness of cross- 
validation estimates and theoretical guarantees of 
penalisation procedures, a new type of general pur- 
pose penalisation procedures, called V-fold penal- 
isation, has been recently proposed [2]. Both em- 
pirical and mathematical evidence of its efficiency 
have been shown in a heteroscedastic with ran- 
dom design regression framework, when considering 
the selection of finite-dimensional histogram mod- 
els. While the selection of regressograms studied in 
[2] is convenient for theory since it allows precise 
mathematical investigations and is however gen- 
eral enough to show some relevant complex phe- 
nomena, we investigate in this paper the behaviour 
of y-fold penalties, and compare it to VFCV, for 
the tuning of the hyperparameters involved in the 
Support Vector Regression algorithm (SVR, [14]) 
and Classification and Regression Trees (CART, 
[6]) for regression. Indeed, these algorithms are 
two of the most extensively used regression tools 
in a wide variety of areas and the choice of efficient 
hyperparameters is known to be a decisive step of 
the learning process to attain good generalisation 
performance. Model selection for SVR has been 
addressed by several authors and many attempts, 
theoretically well-founded, have been proposed to 
answer this problem, among which: estimation of 
the hyperparameters from the data and the level 
of noise [34] [24] [12], leave-one-out bounds for SVR 
[11]. However, methods based on resampling pro- 
cedures for the evaluation of the risk of each model 



have been proven to be significantly better than 
most of the other proposed automatic procedures 
[12, 29] and VFCV is generally the chosen method 
in practice [35]. For CART regression, the issue 
of model selection has not received as much at- 
tention, however [18] provides a theoretical valida- 
tion of the standard CART pruning criterion. In 
this paper, our aim is to study F-fold penalisa- 
tion for model selection and give insights into situ- 
ations when one can improve on VFCV in practice. 
Particularly, the comparison of F-fold penalisation 
with VFCV on the problem of SVR and CART cal- 
ibration takes importance, due to the highlighted 
relevance of VFCV in this central issue. 

The remainder of this paper is organised as fol- 
lows: Section 2 describes the statistical frame- 
work related to model selection for kernel SVR and 
CART. In Section 3 we recall VFCV and related 
works, we introduce in Section 4 V-fold penalisa- 
tion and our improved penalisation approach. Ex- 
periments are addressed in Section 6, and conclu- 
sions are presented in Section 8. 

2 Background and Prelimi- 
naries 

As a first go, we outline the statistical setting of 
the model selection we shall subsequently study 
(generally referred to as the distribution-free re- 
gression setup). Here and throughout, a column 
vector is written in bold lowercase e.g. x. Let 
X x y be a measurable space endowed with an un- 
known probability measure P, with y = M. X 
is called the input space and is usually a compact 
subset of R d , d > 1, and y is the target space. We 
observe n i.i.d. labelled observations or examples 
S = {(xi,j/i),...,(x„,y n )} C X x y. Further- 
more, (A, Y) denotes a generic random variable, 
independent from the data 5, drawn from P. Let 
S be the set of all measurable functions s : X — > y 
mapping from the input to target space. In the 
present paper, focus is on the mean absolute devi- 
ation: 

L(s)=E[\s(X)-Y\}. 

The regression task can thus be rewritten in these 
notations as finding minimum of the so-called least 
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absolute Bayes loss s*, defined by: 
L(s*) = min L(s). 

2.1 Support Vector Regression 

The SVR prediction function is of the form /(x) = 
(w, x) + b where w £ M. d is a weight vector and b is 
a constant. In this case, one is interested in errors 
greater than a user-defined value e £ R + (known as 
the e- insensitive loss). Hence the optimisation task 
can be written as: 

t = argmin w , b ±||w|| 2 + C£?=i(6 +£) 

s.t. t/i - (w,Xj) - 6 < e + & . . 

(w.Xj) - + £ 1 J 

>0, 

where & and £* are slack variables, and C is a wser- 
defined trade-off between minimising the norm of 
the weight vector w (which can be seen as reg- 
ularisation) and penalising errors greater than e. 
A high value of C thus corresponds to a low reg- 
ularisation level and the objective becomes then 
closer to that of minimising the empirical risk. The 
value of e affects the number of Support Vectors 
(SV's in short), with larger values resulting in fewer 
SV's. In a slight abuse of notation, minimum val- 
ues of w and b form a prediction function t. The 
SVR algorithm is often performed using kernels to 
model non-linear functions, where a kernel func- 
tion k : XxX — >R is used to find the inner product 
of the transformation of the input space X into 
its associated Reproducing Kernel Hilbert Space 
(RKHS), denoted by H K . Note that k can be writ- 
ten in terms of a transformation (f> from input to 
kernel space (u,v) n = (<f>(u) , <f>(v)) = k(u,v). 
Kernels functions usually depend on one or a few 
hyperparameters, e.g. polynomial, Gaussian Ra- 
dial Basis Function (RBF) and sigmoid kernels [32] . 
The Gaussian RBF kernel is one of the most com- 
monly used kernels and Boser, Guyon and Vapnik 
suggested its widespread use [5] [19] [37]. In the ex- 
periments described in Section 6, we therefore con- 
sider the Gaussian RBF kernel, 

k 7 (x,x') = cxp(-7 ||x - x'|| 2 ), 

which depends on one real- valued positive parame- 
ter 7. In the following we denote the Gaussian RBF 
kernel k 7 , 7 £ M + . The optimal value of the reg- 
ularisation parameter C can significantly change, 



and depends on the data. To ensure the perfor- 
mance in prediction of the SVR algorithm, the reg- 
ularisation parameter, as well as the kernel, should 
thus be calibrated in each application. Formally, 
the question to be addressed is to find the best pa- 
rameters (7, C, e) in terms of prediction. We thus 
aim at estimating the oracle, which is the model 
with the smallest risk, (where t (7, C, e) is the SVR 
learnt using parameters 7, C, e), 

( 7 ,C,e)eR3 

which is unknown since it depends on the law P of 
data and which optimises the least squares error. 

2.2 CART Regression 

Another important algorithm for regression is 
CART in which one learns a tree like the one exem- 
plified in Figure 1. To regress a new example x, it 
is filtered down to a leaf node via a decision at each 
link and then assigned a real number target. In the 
illustration, if the first feature of x, X( X ) < 61, for 
some threshold 6\, then it is labelled 0.0. Simi- 
larly, if xm > 9i and X( 2 ) > #3 then the example 
is labelled —0.3. Regression trees have been suc- 
cessfully used in a variety of applications in such 
as vector quantisation [13], meteorology [10] and 
medicine [38] for example, and have the crucial ad- 
vantage of being easy to interpret and easy to com- 
pute. To construct a regression tree one starts with 
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Figure 1: An example of a decision tree. 

the root node which contains all of the training ex- 
amples Sq = S. One then decides how to split the 
examples based on a feature k and a threshold 8. 
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Given a choice of these values, the left child con- 
tains the examples Sl = {( x , y) G So | x(fc) < 8} 
and the right one is Sr = {(x,y) G So | *(k) > 
The optimal feature-threshold pair k*,9* is found 
by minimising the squared error of the split, i.e. 
k* , 9* are found using: 

argmin £ (t/-/isj 2 + E (y - ^s„) 2 

fc ' & tx,y)£S L (x,t/)eS R 
s.t. S L = {(x,y) G So I x (fe) < 0} 
S fl = {(x,y) G So | x (fe) > 6}, 

where - ^ E(x,,)€Si » and Ws* = 

I^J S(x 3/)eS R f are tne mean labels for the left and 
right nodes and hence (y — fis ) 2 is the squared er- 
ror between y and the mean label. A simple way 
to solve this optimisation is to iterate through each 
feature and threshold and choose the one with the 
lowest objective value. After splitting on the root 
node, one recursively splits on the resulting child 
nodes until no more splits are possible, i.e. a node 
contains fewer examples than a user defined value 
£. Following the growing phase, one prunes the 
resulting tree as smaller trees have been shown to 
improve generalisation error. In CART, one uses an 
approach called cost complexity pruning which gen- 
erates a series of trees pruned from the original tree 
and then selects one of the trees in the sequence. 
For the ith node in the unpruned tree which con- 
tains examples Si, one computes the error if the 
tree was pruned at that node and compares it to 
the error if the subtree starting at that node Ri is 
kept. The difference in these errors divided by the 
number of leaves of the subtree gives an indication 
of the error difference per leaf, i.e. 

L(Si, Ri) — L(Si, Ri) 
Oii = ; , 

\Ri\i - \Ri\i 

in which L(Si,Ri) is the squared error of a set of 
examples Si using subtree Ri, Ri is the root of Ri, 
and \Ri\i is the number of leaves in Ri. In sim- 
ple terms, the higher the value of on, the bigger 
the reduction in error of the subtree per leaf. One 
can compute on for all nodes in the tree and hence, 
if we prune nodes with on greater than a threshold 
a G {cui, . . . ot\T\), where |T| is the number of nodes 
in the tree, we obtain a sequence of trees which de- 
crease in size as a increases. Instead of choosing 
a directly in the model selection stage, we pick a 



threshold t and choose the largest tree smaller or 
equal in size to t. Therefore, as before, the model 
selection task can be written in terms of search for 
the parameter t. We aim at estimating the oracle 
(where / (t) is the decision tree learnt using param- 
eter t), 

argmin L(f (t)) , 

Estimating the oracle is a model selection task, each 
model being represented here by a fixed value of t, 
where penalisation is a natural way to proceed, as 
explained in Section 4 below. However, let us first 
briefly recall the method which is usually employed 
for model selection, namely V-fold cross-validation. 

3 F-fold Cross Validation 

The idea of cross-validation for model selection is 
to estimate the risk of the considered estimator 
on each model by using a repeated data-splitting 
scheme, and then to select the model that min- 
imises these estimates of the risk. The fact that 
data-splitting strategies give accurate estimates of 
the risks only relies on the independence between 
each training and testing set. Consequently, the 
interest of CV is that it is based on a heuristic 
that can be applied with great universality. Many 
data-splitting rules have been proposed, such as 
leave-one-out (LOO, [1]), leave-p-out (LPO, [33]), 
balanced incomplete CV (BICV, [33]), repeated 
learning-testing (RLT, [7]). F-fold cross-validation 
(VFCV) was introduced by Geisser [17], see also [7] 
as a computationally efficient alternative to LOO 
cross validation. We will consider primarily VFCV, 
which is certainly the most commonly used cross- 
validation rule in practice. Moreover, it is gener- 
ally the procedure which is considered for the cal- 
ibration of the SVR hyperparameters [35] [21]. In 
VFCV, the examples are partitioned into V sub- 
samples of n/V examples each (with a maximal 
deviation of one) B\, ... , By. At the jth fold one 
trains on S \ Bj and then evaluates the error on 
Bj, and one averages the errors over all V folds. 
For model selection this is repeated over a grid of 
parameters in order to select those with the lowest 
error. Despite the generality of the heuristic un- 
derlying the procedure, there are two drawbacks in 
the F-fold cross-validation method for model selec- 
tion. First, at a fixed V, the procedure is known 
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to be asymptotically suboptimal in the sense that 
the risk of the selected estimator is not equivalent 
to the risk of the oracle when the number of data 
tends to infinity. More precisely, [2] shows in a het- 
eroscedastic with random design regression frame- 
work, that VFCV with fixed V satisfies an oracle 
inequality with a constant A > 1 which relates the 
excess risk of the selected model to the excess risk 
of the oracle, and that this suboptimal constant 
cannot be improved asymptotically. The keystone 
of such a result is that at a fixed V, the VFCV cri- 
terion is biased compared to the true risk [9] [31]. 
Indeed, since the validation sets are independent 
of their respective training sets, the expectation of 
the VFCV criterion can be related to the expecta- 
tion of the true risk E[critvFCV s (Q)] for a learner 
s with parameter set Q G Q as follows, 



E 



-U) 



(*<-'> (Q))] =E[L (V- 1 ! (Q)) 



where s' 1 ' (Q) is the output of the learner trained 
with (1 — 1/V)n i.i.d. examples, is the 

learner trained with S \ Bj, and L^p is the loss 
with respect to the partition Bj of S. Since the true 
risk generally decreases with more data, it appears 
that the expectation of VFCV criterion roughly 
overestimates the expectation of the true risk, and 
that this bias should be decreasing whenever V in- 
creases. The previous observation suggests that, 
in order to mimic the oracle in terms of perfor- 
mance in prediction, one should take a V which is 
as large as possible. This is where appears the sec- 
ond drawback concerning VFCV: there is no rule 
in practice to choose the optimal V. Indeed, the 
best CV estimator of the risk is not necessarily the 
best model selection procedure, and [8] highlight 
that LOO is the best estimator of the risk, whereas 
10-fold cross-validation is more efficient for model 
selection purpose. This can be explained by the 
fact the bias in the VFCV estimation of the risk 
is actually an advantage for model selection with 
a few or medium number of examples, contrary to 
the asymptotic framework. Indeed, as claimed in 
[2] a slightly over-pessimistic estimation of the risk, 
as in VFCV, gives for a fixed number of observa- 
tions a more robust model selection procedure and 
roughly contradicts the bad effects of the variance 
of risk estimation. 



4 F-fold Penalisation 

Penalisation is a natural strategy for the task of 
estimating the oracle s (Q*). Indeed, the definition 
of the oracle can be rewritten as the sum of the 
empirical loss and an unknown term, which is thus 
an ideal penalty in the sense that it allows one to 
recover the oracle: 

arg min L s (s (Q)) + pen id (Q) , 

in which s(Q) is a function mapping from input 
to target space under hyperparameters Q, and the 
ideal penalty is as follows, 

pcn id (Q) = L(s (Q))-L s (s(Q)). 

Hence, penalisation aims at mimicking the oracle 
by selecting, for a known penalty function the esti- 
mator 

arg min L s (s (Q j) + pen (Q) . 

A good penalty in terms of prediction is one which 
gives an accurate estimate of the ideal penalty 
pen id . The central idea of V-fold penalties pro- 
posed in [2] is to directly estimate the ideal penalty 
by a subsampled version of it. For some constant 
Cy > V — 1, the F-fold penalty pen VF (Q) is 



(-3) 



(Q)) 



L s~ j) ( s{ ~ j) (Q)) 



3 = 1 



and so the corresponding selected hyperparameters 
are given by 

arg min L s (s (Q)) + pen VF (Q) , 

where T> is a discrete grid upon the set Q. The 
V^-fold penalty indeed mimics the structure of the 
ideal penalty, in such a way that the quantities re- 
lated to the unknown law of data P (respectively 
to the empirical measure Ps) are replaced by quan- 
tities related to the empirical measure Ps (respec- 
tively to the subsampling measures P ( s j) ), in the 
same analytic manner. The design of the 1^-fold 
penalties is thus an adaptation of Efron's resam- 
pling heuristics [15] to the subsampling scheme of 
the V-fold procedure. It has been shown in [2] by 
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considering the selection of regressograms in a het- 
eroscedastic regression framework, that V-fold pe- 
nalisation with Cy = V — 1 is asymptotically op- 
timal for a fixed V, whereas in this case, VFCV 
is asymptotically suboptimal, due to its bias on 
the estimation of the risk. Moreover, the choice of 
Cv = V — 1 in the definition of the V-fold penalty 
corresponds to the Burman's corrected VFCV cri- 
terion [9]. Therefore, we use in our experiments 
Cv = V — 1 although we also explore different 
values. Another advantage, highlighted in [2], of 
V^fold penalisation, compared to VFCV, is that it 
seems to be more regular with respect to the choice 
of V. At a heuristic level, this can be explained by 
observing that since V-fold penalisation corrects 
the bias of VFCV, it is only variability of V-fold 
estimates that matters here, a variability which is 
smaller for a larger V. Finally, it should be noted 
that the constant Cv in the definition of the V-fold 
penalty can be viewed as a degree of freedom, which 
potentially allows to deal with the bias of the pro- 
posed risk estimation, without varying the value of 
V, contrary to VFCV where only V fixes simultane- 
ously and in a tricky manner, the bias and variance 
of the risk estimation. In [2] it is shown that choos- 
ing Cv to overpenalise (i.e. pen VF is larger than 
pen id even in expectation) can improve prediction 
performance when the signal to noise ratio is small. 
The choice is a difficult one however, and accord- 
ing to empirical results on synthetic datasets, it 
depends on the sample size, noise level and smooth- 
ness of the regression function. 



A Complexity-Based Selec- 
tion of Cy 



In practice, as we shall later see, for a fixed train- 
ing set and V, the approximate penalty as given by 
pen VF often poorly approximates the ideal penalty 
and it cannot be improved by varying the penalisa- 
tion constant Cy. To study the cause of the prob- 
lem we analyse the pen VF criterion relative to the 
ideal penalty. Consider the first term in the sum of 
pen VF , Ls {s(~fi (Q)), and write it in terms of the 



loss on the training and test set: 

\ E ^ { - j \Q)) + l E 



V 



in which £i(-) is the loss for the ith example. The 
link between the lines can be seen by noting that 

L^\s^HQ) = —^— E ti(s { - j) (Q)). 



When we put the above form of Ls (s^ & (Q)) into 
pen VF we obtain 



Cv 
V 



(a ( - J) «))-4- J) («<-'>«))) 



j=i 



and the term inside the square brackets is the em- 
pirical expectation of the error on the test set mi- 
nus the error on the training set. One can say that 
this an approximation of the ideal penalty using 
(V — l)n/V examples since the loss term on the 
right side is computed over S \ Bj. A variety of 
error bounds have the penalty proportional to a 
complexity measure and inversely proportional to 
the number of examples to some power of a learning 
rate (3(Q) (see [25] for example). In other words, 
for a learner s with parameters Q we consider the 
following form of the penalty: 



pen y (Q) = 



D(Q) 



V (n(V - 1)/V)0«) 



(2) 



where D{Q) is the complexity of s(Q) and (3(Q) 
is a learning rate, and we have replaced the 
square bracketed term in pen VF with D(Q) / (n(V — 
l)/V) f) (Q\ A learning rate of implies a large 
penalty and that we have overfitted the data and 
hence, for a fixed V and sample size we do not 
learn anything (in other words one predicts on a 
test set randomly). As the sample size increases 
one continues to overfit and hence the penalty term 
is CvD(Q)/V regardless of the sample size. In con- 
trast when j3(Q) = 1 the penalty is small, and 
rapidly decreases with the sample size, and also 
with V. A limit of 1 for (5{Q) is natural for the 
learning rate since this is the bound often used in 
complexity bounds. The ideal penalty has the form 



G 



D(Q)/nP^ for n examples and hence we would 
like to choose Cy above so that 

Cv D(Q) _ D(Q) ( ] 

V (n{V -l)/Vy(Q) nPiQV K> 

and solving gives Cy = (V - l)P(Q) /V^' 1 . A 
learning rate of which occurs with complex mod- 
els (e.g. large decision trees) implies Cy — V and 
similarly for small models where f3(Q) = 1 we have 
Cy — V — 1 as suggested asymptotically above. In 
this latter case we recover exactly the value of Cy 
suggested in [2]. On the whole, peny(Q) is an es- 
timation of the ideal penalty using pen VF and the 
model complexity D{Q). It remains to consider 
how one computes the learning rate. We equate 
Eq. (2) with pen VF and then taking logs results in 
log(pen VF ((2)/CV) equal to 

- \og{V) - p{Q) \og{n{V - I) /V) + log(D(Q)). 

One finds the gradient of \og(pen v F (Q) /Cy) + 
log(V) versus log(n(V — 1)/V) for a selection of 
different V values whilst fixing Q and n, in order 
to find the learning rate 0(Q). 

6 Experimental Setup 

We study the behaviour of VFCV and V-fold pe- 
nalisation on a collection of benchmark datasets. 
The scikit-learn library in Python [30] is used to 
generate the output of the RBF SVR and CART 
algorithms. 

In total, 10 datasets from the UCI machine learn- 
ing repository [16] and DELVE [28] are used. Each 
dataset is split into 100 training and test realisa- 
tions/sets after being processed so that the exam- 
ples and labels have zero mean and unit standard 
deviation. Details are provided in Table 1. When 
comparing model selection algorithms, a statisti- 
cally significant improvement of one method over 
another is such that the mean error is greater and 
by using a paired i-test. For the t-test we take the 
sample of errors over all realisations for 2 methods, 
then compute a p value and reject the null hypoth- 
esis, that the means are equal, if p < 0.1. In all 
experiments we use the mean absolute error, i.e. 
for a prediction function / : X — \ y, the error is 

^zr=i Ufa) -yi\\i- 



Dataset Learn Test d Abrv. 



Table 1: Information on the benchmark datasets 
used. There are 100 learn/test splits for each 
dataset. 

6.1 Model Selection 

In all of the following experiments we use a grid 
to approximate the set of hyperparameters. The 
SVR penalty is chosen as C e {2- 10 , 2- 8 , . . . , 2 12 }, 
the kernel width as 7 e {2~ 10 , 2~ 8 , . . . , 2 2 }, and 
e 6 {2~ 4 , 2~ 3 }. More sophisticated ways of search- 
ing in the hyperparamcter space actually exist, 
such as the iterative process derived from the so- 
called active sets method and used in [20, 12, 29] 
to walk along the entire path of the SVR: 7 is fixed 
and all values of C are considered. Others heuris- 
tics involve e.g. genetic algorithms [22], local search 
methods [27]. However, some degeneracies can oc- 
cur and so, a search on a grid should be more sta- 
ble. Moreover, the grid-search has also the advan- 
tage of being easily parallelised, because each value 
of (7, C, e) is independent from the others. In the 
case of CART regression, we pick the bound on the 
tree size t from {2 1 - 1, round(2 15 - 1), . . . , 2 7 - 1}. 

An important characterisation of the model 
picked during the selection phase is its complex- 
ity. In all of the model selection techniques we 
choose a set of parameters over n(V — 1)/V ex- 
amples however the final predictor is training using 
all n examples. Ideally, we would like the com- 
plexity to be identical in both model selection and 
whilst training using all examples, since the penalty 
is a function of complexity (Equation 2). For the 
SVR the norm of the weight vector is the measure 
of complexity used in error bounds (see [35]). For 
this reason, we compute the mean norm of the SVR 
weight vector ||w||, for each value of C, 7, e used in 
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model selection and store ||w*||, the norm corre- 
sponding to the lowest error, as well as the corre- 
sponding values 7* , e* . When we train using all n 
examples, we again compute ||w||'s corresponding 
to each value of C, for 7*,e*, and choose C with 
corresponding norm closest to ||w*||. In the case 
of CART trees, we can be slightly more direct: for 
the optimal bound on the tree size, t* , we compute 
the real mean tree size found during model selec- 
tion and round to the nearest integer t. The value 
of t is then used to train over all n examples. 



6.2 Primary Setup 

In order to test the model selection techniques we 
take random training subsamples of either 50, 100 
or 200 examples of the learning sets to observe 
model selection on a limited number of examples. 
Furthermore, we test using 2, 4, . . . , 12 folds. Model 
selection is performed using each subsample and 
then SVR or CART is trained using the optimal 
parameters and the entire subsample. This is re- 
peated for each realisation and results are aver- 
aged over the entire set of realisations. As well as 
recording the error obtained using model selection 
over the realisations, we also store the difference be- 
tween the "ideal" and approximated penalty. This 
former quantity is computed simply as the differ- 
ence between the F-fold penalty and the penalty 
as computed using the test set. All results for V- 
fold penalisation are evaluated with Cy = (V — l)a 
with a £ {0.6, 1.2, . . . , 1.6} being the multiplier for 
the penalisation. We denote the types of model 
selection methods as: VFCV, V- fold penalisation 
(PenVF) and V-fold penalisation using a learning 
rate (PenVF+). 

For the PenVF+ method one needs to compute 
learning rates for each model (set of parameters). 
We use the same training sets as above and vary 
V from the set {2, 3, . . . , 12}. The quantities 
log(pen y (Q)) + \og(V) versus log(n(V — 1)/V) are 
computed and the gradient, found using linear re- 
gression, provides f3(Q) which in turn is used to 
calculate C v = (V - i)/3(Q)/y/3(Q)-i. As this es- 
timation of /3(Q) can be unstable especially with 
small training sets we clip its value to lie within 
the valid range [0,1]. 



7 Experimental Results 

7.1 Comparison of Penalisation and 
VFCV with V = 2 

We start by studying the SVR results in Ta- 
ble 2 which shows errors for all datasets when 
V = 2, a = 1.0. We consider V = 2 in this 
case since it provides the greatest distinction be- 
tween the model selection methods. For PenVF+ 
many of the results are comparable to VFCV 
when we also consider the standard deviations. 
As the same time, PenVF+ does not always im- 
prove upon PenVF. In contrast PenVF can perform 
significantly worse than both VFCV and PenVF+, 
for example with abalone, winequality-red and 
winequality-white. Also of note is that the dif- 
ference in error between VFCV and PenVF does not 
improve with m = 200 with abalone for example: 
it is 0.08, 0.089, 0.089 with m = 50, 100, 200. 

Also shown at the bottom of Table 2 is the equiv- 
alent CART results. It is evident that error rates 
are generally worse than the SVR with the ex- 
ception of pumadyn-32nh. Also, we see that pe- 
nalisation provides a larger advantage relative to 
VFCV in this case. One explanation is that CART 
is more sensitive to its hyperparameters. We ob- 
serve that PenVF+ is equivalent or improves over 
VFCV in nearly every case, and there are 5, 7, 7 
wins for m = 50, 100, 200 respectively. Again 
we see that PenVF performs poorly with abalone, 
winequality-red and winequality-white. 

7.2 Paired t-test Comparison with 
VFCV 

Table 3 shows the results of the paired t-tests to 
compare PenVF with a = 1.0 and PenVF+ with 
VFCV. Consider hrst the SVR results. Here we 
see that as one might expect, there are few sta- 
tistically significant differences between 10-fold CV 
and PenVF. Indeed, results indicate that as V in- 
creases the penalisation methods and VFCV be- 
come more similar. In particular we see that 
PenVF+ is identical to VFCV in all but one or two 
cases, with the main exception being 5 draws with 
m = 200 and V = 2, in which there are 2 im- 
provements and 3 losses (abalone, pumadyn-32nh 
and winequality-red). PenVF fares worse against 
VFCV: we see more wins but at the same time more 
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.128 
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(.014) 


.783 (.018) 
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CART 
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(.057) 


.696 (.065) 


.713 (.065) 


.665 


(.044) 


.667 (.049) .691 (.052) 


.633 


(.028) 


.632 (.040) 


.661 


(.036) 


ad 
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(.063) 


.725 (.066) .720 (.072) 
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(.055) 


.616 (.048) .624 (.040) 
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(.036) 


.556 (.027) 


.571 


(.030) 
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.180 (.023) 
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(.067) 


.797 (.065) 


.799 (.072) 


.776 


(.059) 


.756 (.061) .765 (.066) 
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.771 


(.040) 


.762 (.041) .782 (.059) 
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Table 2: Error rates (with standard deviations in parentheses) for cross validation with the SVR (top) 
and CART (bottom) and penalisation for V — 2, a — 1.0. Statistically significant improvements over 
VFCF are in bold. With the SVR, PenVF+ is generally comparable to VFCV and PenVF is more variable. 
With CART, the penalisation methods both improve over VFCV a number of times. 



losses. Our later analysis will shed light on why this 
is the case. We also compared the "ideal" model, 
in which the test set is used during model selec- 
tion, with VFCV and found that one can generally 
gain improvements except in the case of slice- 
loc, which has a large number of features. In ev- 
ery case the ideal model selector can improve over 
2-fold CV. 

The CART results at the bottom of Table 3 show 
that for V > 4 penalisation is identical to VFCV. 
Clearly the bias with low values of V in conjunc- 
tion with VFCV is more prominent in this case. 
We have already examined the V — 2 case and 
with V = 4 we see improvements in one case for 
PenVF+ for m = 50,200. With the ideal errors 
we see that with more of the datasets compared 
to SVR, performance cannot be improved by using 
the test realisation and furthermore as m increases 
improvements over VFCV are increasingly difficult 
except for the 2-fold case. 



7.3 Optimal Penalisation Constant 
is Dataset Dependent 

To discover the effect of ovcrpcnalisation, observe 
Figure 2 which shows the errors on 2 datasets as 
a varies when V = 10. The effect of a is clearly 
dataset dependent: on abalone observe that the 
error tends to decrease with the SVR with more pe- 
nalisation, and hence a slight amount of overpenali- 
sation (a = 1.2) is recommended. In contrast, over- 
penalisation increases the error with slice-loc. 
In total, 5 of the datasets benefited from overpe- 
nalisation and 3 improved with underpenalisation. 
With pumadyn and parkinsons-total, a value of 
a = 1.0 as predicted by the theory gave optimal 
results. Results were similar with CART except 
that 7 of the datasets benefit from underpenalisa- 
tion. Notice that as sample size m increases the 
choice of a becomes less important since estimates 
of error for each model are more reliable and the 
penalty terms become small. We noticed that with 
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Table 3: Number of statistically significant losses (left block), draws (middle block) and wins (right 
block) against standard errors for CV and across different numbers of folds using a — 1.0. The sample 
size m is 50 (top), 100 (middle), and 200 (bottom). As V increases the penalisation methods become 
more similar to VFCV. In particular for CART, when V > 4 penalisation is identical to VFCV. 



CART, VFCV consistently underestimates the tree 
size whereas PenVF chooses larger sizes in general 
and this can be an advantage or disadvantage de- 
pending on the variation of error with t. Also ob- 
served is that as expected VFCV provides a pes- 
simistic error compared to the ideal case and PenVF 
is generally more accurate than VFCV, however, as 
in model selection since we pick the model with the 
lowest error, this does not always translate into the 
best predictor. 

7.4 On the Estimation of the Ideal 
Penalty 

Next we study the approximated penalty and how 
it differs from the "ideal" penalty in the case of 
CART with V = 2, see Figure 3. Notice that 
the curves for PenVF and PenVF+ are shorter than 
the ones for the "ideal" penalty since only half the 
examples are used for training, limiting the tree 
size. The PenVF method diverges from the ideal 
case when we grow large trees, but is close to the 



ideal case for small trees. This change occurs with 
relatively small trees: size 4 with m — 50 and 
size 11 with m = 100. This pattern was observed 
with most of the datasets. When we looked at a 
greater number of folds, PenVF was close to the 
ideal penalty. In contrast, PenVF+ does not di- 
verge as the tree size increases, however it seems 
to slightly overestimate the penalty. 

The question remains about which cases PenVF 
improves over VFCV for low values of V for CART. 
Figure 4 demonstrates that the error estimation 
for PenVF is generally optimistic as model size in- 
creases. With abalone for example the optimal 
tree was of size 7 nodes, however PenVF chooses 
one of size 22. In contrast, on some datasets large 
trees did not overfit the test set and hence in these 
cases PenVF can perform better than VFCV. This 
also sheds some light on Figure 2 where we see that 
overpenalisation helps in some cases but not in oth- 
ers. PenVF+ provides the best estimate of the error 
in general, however it also results in a larger tree 
size that the ideal case. 
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7.5 Extended Setup 

In this final set of experiments we explore further 
the distinction between penalisation and VFCV 
with CART by considering m = 500 and using the 
same set of folds as in the original setup, see Ta- 
ble 4. The interesting figures in this case are those 
corresponding to 2 and 4 folds in which we see that 
PenVF+ wins for 8 and 2 datasets respectively. In 
fact, since in the ideal case one can only win 8 times 
with V = 2 this certainly demonstrates the effec- 
tiveness of PenVF+ and bias in VFCV in this case. 

7.6 Key Points 

Our experimental analysis has painted a detailed 
picture of penalisation versus cross validation for 
model selection. The bias in VFCV is evident 
with small values of V and small training sets, 
and we observed that as V and the training set 
sizes increase the model selection methods become 
more similar. With the SVR, PenVF makes a num- 
ber of loses relative to VFCV and these losses are 
nearly all corrected with our modified penalisation 
PenVF+. Penalisation is more effective in general 
with CART: when V > 4 both PenVF and PenVF+ 
are not statistically significantly different to VFCV, 
and for V = 2, PenVF+ is at least as good as VFCV 
or improves over it in nearly every case, winning 
5, 7, 7 times for m = 50, 100, 200 examples. In 
contrast PenVF is more variable in comparison to 
VFCV and one reason for this is that it underes- 
timates the penalty to a large degree with large 
models. On some datasets larger trees did not in- 
crease the error and hence in these cases PenVF per- 
forms well. In general PenVF+ provides a much bet- 
ter approximation of the ideal penalty compared to 
PenVF. The most striking results were with CART 
and V = 2 in which we saw that PenVF+ improves 
over VFCV in 8 out of 10 cases with m = 500 ex- 
amples. 

8 Conclusions 

Model selection is a critical part of machine learn- 
ing as it can dramatically affect generalisation per- 
formance. In practice, cross validation over a grid 
of parameter values is often used, and it has been 
shown to be very effective in a variety of cases. We 



studied V-fold penalisation which is a general pur- 
pose penalisation procedure that aims at improv- 
ing on VFCV by correcting its bias and is proved 
in [2] to be asymptotically optimal in a histogram 
regression setting. V-fold penalisation is simple to 
implement and the penalised error can be computed 
using the same predictions as cross validation and 
hence at negligible additional computational cost. 
Furthermore, we propose an improvement of pe- 
nalisation, called PenVF+, which takes into account 
learning rates in order to correct under-penalisation 
with large models. 

We conducted an extensive empirical investiga- 
tion into VFCV and V-fold penalisation over a col- 
lection of 10 well known benchmark datasets using 
an SVR with the RBF kernel and CART. With 
low values of V, penalisation can provide an ad- 
vantage over VFCV but this advantage rapidly di- 
minishes as V increases. Furthermore, in some 
cases penalisation fared worse than cross validation. 
When we compare the penalty with the "ideal" 
penalty, we observed that PenVF underestimates 
the penalty with large models and PenVF+ improves 
penalty estimation in these cases. Hence, there is 
no fixed overpcnalisation constant even for a par- 
ticular dataset, but rather the penalisation should 
vary with model complexity as with PenVF+. 
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(a) SVR, abalone (top) and slice-loc (bot- 
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(b) CART, winequality-white (top) and 
slice-loc (bottom) 

Figure 2: The variation in error with a for some 
sample datasets with V — 10. The plots are or- 
dered from top to bottom: m = 50, 100, 200, for 
example for winequality-white and CART the re- 
spective curves are dotted, solid with crosses and 
solid with pluses. With abalone and the SVR, and 
winequality-white with CART ovcrpenalisation 
improves results, however it makes them worse with 
slice-loc for both the SVR and CART. 




(a) m = 50 




Figure 3: The variation in penalty for CART in 
the "ideal" case relative to PenVF a = 1.0, and 
PenVF+ with V = 2 and abalone. PenVF+ es- 
timates the ideal error well across a range of i's, 
whereas PenVF underestimates it for large t. 




PenVF+ 
PenVF 



Figure 4: The error for CART in the "ideal" case 
relative to PenVF a = 1.0, and PenVF+ with V = 2 
and abalone, m = 50. PenVF provides a poor es- 
timation of the error for large values of t. PenVF+ 
gives better error estimates but results in the selec- 
tion of larger trees than VFCV. 
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